Skip to content

fix: real liveness probe on /health (Bug #8)#54

Open
avrabe wants to merge 1 commit into
mainfrom
fix/health-real-liveness
Open

fix: real liveness probe on /health (Bug #8)#54
avrabe wants to merge 1 commit into
mainfrom
fix/health-real-liveness

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 2, 2026

Summary

  • New `src/health.js` module with `evaluateHealth({scheduler, kv, dataDir})`.
  • Both `/health` endpoints (Express router + Probot v14 addHandler) now return 503 when any check fails.
  • Closes Bug feat: add auto-merge for Dependabot and thrum PRs #8 from `docs/agent-fleet/bugs.md`.

What's checked

Probe Signal On fail
scheduler last tick > 2× interval ago unhealthy → 503
kv (SQLite) `SELECT 1` succeeds unhealthy → 503
disk data dir free bytes ≥ 100 MB unhealthy → 503
any probe throws (probe itself broken) degraded → 200

Why three states

`degraded` exists for "the probe itself broke" cases (e.g. `statfs` throws on a missing path). Returning 200 keeps PM2 from restarting (which wouldn't fix the underlying probe issue) while still flagging it visibly in the response body.

Test plan

  • 12 new unit tests in `tests/unit/health.test.js`
  • Full suite: 846 pass (was 834)
  • `npm run lint` clean
  • Live-test on netcup after deploy: kill the scheduler tick handler, verify /health returns 503

Risk

Medium — this is an intentional behaviour change. `/health` can now return non-200. External monitors that previously treated temper as always-healthy should be re-checked, but the whole point is that they should restart on real outages.

🤖 Generated with Claude Code

Why: DevOps/SRE flagged that /health returned 200 unconditionally.
PM2's restart-on-failed-healthcheck and any external uptime monitor
were both blind to a hung scheduler, a locked SQLite DB, or a full
data disk. The bot could be silently dead for hours.

What:
  - New module src/health.js: pure-function evaluateHealth(probes)
    with three checks (scheduler tick freshness, SQLite ping, disk
    free) and three states (healthy / degraded / unhealthy).
  - Scheduler exposes getLastTickAt() and getIntervalMs(); tick
    timestamp is updated in finally so even error paths refresh it.
    /health flags the scheduler as 'fail' when the last tick is older
    than 2× the interval.
  - persistent-kv exposes ping() — a SELECT 1 against the open
    Database. Throws on locked / broken file.
  - app.js wires both /health endpoints (Express router + Probot v14
    addHandler) through evaluateHealth and returns 503 when
    probe.ok=false. The body always includes a `checks` map so PM2
    logs / dashboards can see *which* check tripped.

The three states:
  healthy:    every check passed                              → 200
  degraded:   a probe missing, threw, or scheduler had no
              tick yet — not actionable but visible            → 200
  unhealthy:  a check failed in a way that means the bot
              cannot do its job (DB ping throws, scheduler
              hung, disk full)                                 → 503

Test plan:
  - 12 new unit tests in __tests__/unit/health.test.js covering each
    probe's pass/fail/error/missing branches plus the most-severe-wins
    aggregation rule.
  - Full suite: 846 pass (was 834), lint clean.
  - Existing /health integration tests still pass — when probes are
    not injected (test setup leaves them null) evaluateHealth returns
    healthy so the 200 path is unchanged.

Risk: medium. The behaviour change is intentional — /health can now
return 503. Any external monitor that currently treats temper as
always-healthy should be re-checked, but the whole point is that they
should now restart on real outages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant